knitr::opts_chunk$set(echo = TRUE, comment = "#>", dpi = 300)
for (f in list.files(here::here("src"), pattern = "R$", full.names = TRUE)) {
source(f)
}
# library(conflicted)
library(glue)
library(tidyverse)
theme_set(theme_classic())
light_blue <- "#7AAED1"
dark_blue <- "#011F4B"
In this exercise, you will use a dose-response relation model that is used in Section 3.7. The used likelihood is the same, but instead of uniform priors, we will use a bivariate normal distribution as the joint prior distribution of the parameters \(\alpha\) and \(\beta\).
Below is a description of the bioassy from the reading instructions (with some minor changes to help the grammar):
The example is from Racine et al. (1986) (see ref in the end of the BDA3). [This] Swiss company makes classification[s] of chemicals to different toxicity categories defined by [governmental] authorities (like [the] EU). Toxicity classification is based on [the] lethal dose 50% (LD50) which [indicates] what amount of [a] chemical [that] kills 50% of the subjects. [The] [s]maller the LD50, [the] more lethal the chemical is. The original paper mentions “1983 Swiss poison Regulation” which defines [the] following categories for chemicals orally given to rats (mg/ml):
| Class | LD50 |
|---|---|
| 1 | <5 |
| 2 | 5-50 |
| 3 | 50-500 |
| 4 | 500-2000 |
| 5 | 2000-5000 |
To reduce the number of rats needed in the experiments, the company started to use Bayesian methods. The paper mentions that in those days, [the] use of just 20 rats to define the classification was very little. [The] book gives the LD50 in log(g/ml). When the result from demo3_6 is transformed to mg/ml, we see that the mean LD50 is about 900 and \(p(500 < LD50 < 2000) \approx 0.99\). Thus, the tested chemical can be classified as category 4 toxic.
a) In the prior distribution for \(\alpha, \beta)\), the marginal distributions are \(\alpha \sim N(0, 2^2)\) and \(\beta \sim N(10, 10^2)\), and the correlation between them is \(\text{corr}(\alpha, \beta) = 0.6\). Report the mean (vector of two values) and covariance (two by two matrix) of the bivariate normal distribution.
The definition of a bivariate normal distribution is
\[ \begin{pmatrix} X_1 \\ X_2 \end{pmatrix} \sim N \begin{bmatrix} \begin{pmatrix} \mu_1 \\ \mu_2 \end{pmatrix} , \begin{pmatrix} \sigma_1^2 & \rho \sigma_2 \sigma_1 \\ \rho \sigma_1 \sigma_2 & \sigma_2^2 \\ \end{pmatrix} \end{bmatrix} \]
Because \(\alpha\) and \(\beta\) are both normal distributions, but the mean vector is
\[ \begin{pmatrix} \mu_\alpha \\ \mu_\beta \end{pmatrix} = \begin{pmatrix} 0 \\ 10 \end{pmatrix} \]
and the covariance matrix is
\[ \begin{aligned} \begin{pmatrix} \sigma_\alpha^2 & \rho \sigma_\beta \sigma_\alpha \\ \rho \sigma_\alpha \sigma_\beta & \sigma_\beta^2 \\ \end{pmatrix} &= \begin{pmatrix} 2^2 & 0.6 \times 10 \times 2 \\ 0.6 \times 2 \times 10 & 10^2 \\ \end{pmatrix} \\ &= \begin{pmatrix} 4 & 12 \\ 12 & 100 \\ \end{pmatrix} \end{aligned} \]
N <- 1e5
mu_a <- 0
sigma_a <- 2
mu_b <- 10
sigma_b <- 10
rho <- 0.6
cov_value <- rho * sigma_a * sigma_b
mu <- c(mu_a, mu_b)
cov_mat <- matrix(
c(sigma_a^2, cov_value, cov_value, sigma_b^2),
nrow = 2,
byrow = FALSE
)
bvn <- MASS::mvrnorm(N, mu = mu, Sigma = cov_mat)
plot_df <- as.data.frame(bvn) %>%
as_tibble() %>%
set_names("alpha", "beta")
plot_points_and_density <- function(df,
x,
y,
n_samples = 1000,
point_alpha = 0.6,
density_bins = 5) {
df %>%
ggplot(aes(x = {{ x }}, y = {{ y }})) +
geom_point(
data = sample_n(df, size = min(c(nrow(df), n_samples))),
alpha = point_alpha,
color = light_blue
) +
geom_density_2d(bins = density_bins, linetype = 2, color = dark_blue)
}
plot_points_and_density(plot_df, alpha, beta)
source: PennState, Elberl College of Science: “4.2 - Bivariate Normal Distribution”
b) You are given 4000 independent draws from the posterior distribution of the model. Report the mean as well as 5% and 95% quantiles separately for both \(\alpha\) and \(\beta\). Report also the Monte Carlo standard errors (MCSEs) for the mean and quantile estimates. Explain in words what does Monte Carlo standard error mean and how you decided the number of digits to show.
bioassay_posterior <- read_data(
"bioassay_posterior.txt",
read_tsv,
col_names = FALSE
)
colnames(bioassay_posterior) <- c("alpha", "beta")
head(bioassay_posterior)
#> # A tibble: 6 × 2
#> alpha beta
#> <dbl> <dbl>
#> 1 -0.0205 10.0
#> 2 1.22 4.50
#> 3 3.05 16.2
#> 4 1.32 4.92
#> 5 1.36 12.9
#> 6 1.09 5.94
p <- plot_points_and_density(
bioassay_posterior,
alpha,
beta,
n_samples = Inf,
point_alpha = 0.3
)
p
MCSE accuracy of averages: \(s_\theta / \sqrt(S)\) where \(s_\theta\) is the standard deviation of the draws and \(S\) is the number of draws.
#> alpha beta
#> 0.01482435 0.07560016
Calculate the means of the posterior distributions.
print("Posterior means:")
#> [1] "Posterior means:"
apply(bioassay_posterior, 2, mean)
#> alpha beta
#> 0.9852263 10.5964813
Calculate the MCSE of the quantiles.
apply(bioassay_posterior, 2, function(x) {
unlist(c(
aaltobda::mcse_quantile(x, 0.05),
aaltobda::mcse_quantile(x, 0.95)
))
})
#> alpha beta
#> mcse 0.02600412 0.07043125
#> mcse 0.04206342 0.24121289
print("Posterior 5% and 95% quantiles:")
#> [1] "Posterior 5% and 95% quantiles:"
#> alpha beta
#> 5% -0.4675914 3.991403
#> 95% 2.6102028 19.340365
Accounting for the precisions of MCSE:
MCSE is the expected error from the stochastic process of Monte Carlo simulations. Here, it is used to determine the number of digits to report where we report the digits where the MCSE is 0.
p +
geom_vline(xintercept = 1.0, color = "red") +
geom_ribbon(aes(xmin = -0.5, xmax = 2.6), alpha = 0.1, fill = "red") +
geom_hline(yintercept = 10.6, color = "orange") +
geom_ribbon(aes(ymin = 4.0, ymax = 19), alpha = 0.1, fill = "orange") +
scale_x_continuous(expand = expansion(0)) +
scale_y_continuous(expand = expansion(0))
c) Implement a function for computing the log importance ratios (log importance weights) when the importance sampling target distribution is the posterior distribution, and the proposal distribution is the prior distribution from question a. Explain in words why it’s better to compute log ratios instead of ratios.
It is better to compute log ratios to avoid overflow or underflow that can occur more easily when using the ratios.
As hinted in the assignment (see PDF), because we are using the prior distribution as the proposal distribution, the calculation of the importance weights is much easier.
The importance weight \(w(\theta^s)\) is calculated as follows:
\[ w(\theta^s) = \frac{q(\theta^s|y)}{g(\theta^s)} \]
With the proposal distribution \(g\) set as the prior distribution, \(g(\theta) = p(\theta)\), the equation simplifies to:
\[
\begin{aligned}
w(\theta^s) &= \frac{q(\theta^s|y)}{g(\theta^s)} \\
&= \frac{p(\theta^s) p(y|\theta^s)}{p(\theta^s)} \\
&= p(y|\theta^s)
\end{aligned}
\] Thus, the importance weight is just the likelihood for the model. In the question(again, see the PDF), we are asked to use aaltobda::bioassaylp() to calculate the logarith of the likelihood.
log_importance_weights <- function(a, b) {
bioassay_data <- read_bioassay_data(show_col_types = FALSE)
aaltobda::bioassaylp(
alpha = alpha,
beta = beta,
x = bioassay_data$x,
y = bioassay_data$y,
n = bioassay_data$n
)
}
# Test data.
alpha <- c(1.896, -3.6, 0.374, 0.964, -3.123, -1.581)
beta <- c(24.76, 20.04, 6.15, 18.65, 8.16, 17.4)
test_results <- round(log_importance_weights(alpha, beta), 2)
stopifnot(all(close_to(
test_results,
c(-8.95, -23.47, -6.02, -8.13, -16.61, -14.57)
)))